As someone who is thinking about going into the field of Data Science and Machine Learning, I have looked through online job ads for data scientists and got overwhelmed by the amount of skills required. It would take a very long time to learn the whole repertoire of data science tools.
To help decide which tools to learn first, I decided to look at which data science skills are currently most in-demand by employers in major cities in Canada and U.S.
Just like any sane person would do these days, I googled “web scraping indeed” to see if I can scrape its job ads to analyse them. I came across this awesome blog by Jesse Steinweg-Woods published in March 2015. He looked at the data science tools mentioned in the job ads from indeed for Seattle, Chicago, San Francisco, New York and Nationwide.
His findings from nationwide search suggested that Python is much more in demand than R. This really intrigued me to see if the trends he found a year ago are still the same.
My initial goals were to 1. To examine the Data Science skills that most employers are looking for in Canada and U.S. 2. To examine the ranking of Canadian and American cities based on opportunities for Data Scientists
However, as I searched for job ads for data scientists in Canada, I came to realize that there weren’t enough ads for most of the cities. So I decided to focus on the national trend for Canada.
As for U.S., I initially wanted to look at the top 100 largest cities. During the data collection process, some of the cities didn’t have enough or no ads either. So I decided to focus on the following 53 cities, that had enough job ads for data scientists, instead.
While gathering data from Data-Scientists job ads, I started to wonder if the skills, that the employers are looking for, would be different for other data-driven jobs, specifically for Data Analysts, Data Architects and Data Engineers. So, I decided to also start collecting data for those as well.
I used indeed.com and indeed.ca to gather job ads.
I wanted to use rvest to scrape data from indeed. However, after spending a few days of trying, I couldn’t get the scraping function to work, since indeed is a job search aggregator and the links it provides go to different company websites that have different structure. This is my first time scraping so I probably didn’t know enough R tools to make it work.
So, I decided to see if I could modify Jesse’s ipython code to get .csv files of the wordcounts of the names of the Data Science Tools from the job ads. His code was written to produce plots of the results right after scraping, not produce .csv files.
You can find the ipython notebooks here.
You may see in the ipython notebooks that the code was a bit repetitive. I was trying to stick to d.r.y (don’t repeat yourself) principle but I have never programmed in Python, so this will be a work in progress for me to make the code neater. In the meantime, this code was enough to collect the data.
Data was collected for job ads that had exactly “Data Analyst”, “Data Scientist”, “Data Architect” or “Data Engineer” in its text. Since the search was done for each job title for each of the 53 cities mentioned above, I ended up with 212 (53 times 4) .csv files. I had to do quite a bit of wrangling to comebine them together.
Out of the 53 cities, 14 of them did not have enough (more than 10) or no job postings for data engineers. Therefore, I used 39 cities to compare the skills among “Data Anlysts”, “Data Scientists”, “Data Architects” and “Data Engineers”. I also used data from 53 cities to compare the skills among “Data Anlysts”, “Data Scientists” and “Data Architects”.
The data collected include the percentage of job ads that contain the names of the analysis tools or programming languages, specified below, and the number of job postings per each job title in each city.
I had separate .csv files of word count percentages (Data Science Tools) for DAnalysts, DScientists, DArchitects and DEngineer per city. I’m combining them to analyse the data.
library(readr)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Joining datasets:
#NewYork
NewYork1 <- read_csv("NewYorkSkills1.csv") #.csv for Data Analysts' job ad data
NewYork2 <- read_csv("NewYorkSkills2.csv") #.csv for Data Scientists' job ad data
NewYork3 <- read_csv("NewYorkSkills3.csv") #.csv for Data Architects' job ad data
NewYork4 <- read_csv("NewYorkSkills4.csv") #.csv for Data Engineers' job ad data
names(NewYork1) <- c("Term", "DAnalyst")
names(NewYork2) <- c("Term", "DScientist")
names(NewYork3) <- c("Term", "DArchitect")
names(NewYork4) <- c("Term", "DEngineer")
j1 <- full_join(NewYork1, NewYork2, by="Term")
j2 <- full_join(NewYork3, NewYork4, by="Term")
NewYork <- full_join(j1, j2, by="Term")
NewYork$City <- "New York"
NewYork$State <- "NY"
NewYork <- replace_na(NewYork, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
#Since NAs mean the term didn't appear in the job postings and thus assigned zero.
write_csv( NewYork, path = "NewYork.csv")
#Anaheim
Anaheim1 <- read_csv("AnaheimSkills1.csv")
Anaheim2 <- read_csv("AnaheimSkills2.csv")
Anaheim3 <- read_csv("AnaheimSkills3.csv")
Anaheim4 <- read_csv("AnaheimSkills4.csv")
names(Anaheim1) <- c("Term", "DAnalyst")
names(Anaheim2) <- c("Term", "DScientist")
names(Anaheim3) <- c("Term", "DArchitect")
names(Anaheim4) <- c("Term", "DEngineer")
j1 <- full_join(Anaheim1, Anaheim2, by="Term")
j2 <- full_join(Anaheim3, Anaheim4, by="Term")
Anaheim <- full_join(j1, j2, by="Term")
Anaheim$City <- "Anaheim"
Anaheim$State <- "CA"
Anaheim <- replace_na(Anaheim, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv( Anaheim, path = "Anaheim.csv")
#Arlington
Arlington1 <- read_csv("ArlingtonSkills1.csv")
Arlington2 <- read_csv("ArlingtonSkills2.csv")
Arlington3 <- read_csv("ArlingtonSkills3.csv")
Arlington4 <- read_csv("ArlingtonSkills4.csv")
names(Arlington1) <- c("Term", "DAnalyst")
names(Arlington2) <- c("Term", "DScientist")
names(Arlington3) <- c("Term", "DArchitect")
names(Arlington4) <- c("Term", "DEngineer")
j1 <- full_join(Arlington1, Arlington2, by="Term")
j2 <- full_join(Arlington3, Arlington4, by="Term")
Arlington <- full_join(j1, j2, by="Term")
Arlington$City <- "Arlington"
Arlington$State <- "TX"
Arlington <- replace_na(Arlington, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv( Arlington, path = "Arlington.csv")
#Atlanta
Atlanta1 <- read_csv("AtlantaSkills1.csv")
Atlanta2 <- read_csv("AtlantaSkills2.csv")
Atlanta3 <- read_csv("AtlantaSkills3.csv")
Atlanta4 <- read_csv("AtlantaSkills4.csv")
names(Atlanta1) <- c("Term", "DAnalyst")
names(Atlanta2) <- c("Term", "DScientist")
names(Atlanta3) <- c("Term", "DArchitect")
names(Atlanta4) <- c("Term", "DEngineer")
j1 <- full_join(Atlanta1, Atlanta2, by="Term")
j2 <- full_join(Atlanta3, Atlanta4, by="Term")
Atlanta <- full_join(j1, j2, by="Term")
Atlanta$City <- "Atlanta"
Atlanta$State <- "GA"
Atlanta <- replace_na(Atlanta, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv( Atlanta, path = "Atlanta.csv")
#Aurora
Aurora1 <- read_csv("AuroraSkills1.csv")
Aurora2 <- read_csv("AuroraSkills2.csv")
Aurora3 <- read_csv("AuroraSkills3.csv")
Aurora4 <- read_csv("AuroraSkills4.csv")
names(Aurora1) <- c("Term", "DAnalyst")
names(Aurora2) <- c("Term", "DScientist")
names(Aurora3) <- c("Term", "DArchitect")
names(Aurora4) <- c("Term", "DEngineer")
j1 <- full_join(Aurora1, Aurora2, by="Term")
j2 <- full_join(Aurora3, Aurora4, by="Term")
Aurora <- full_join(j1, j2, by="Term")
Aurora$City <- "Aurora"
Aurora$State <- "CO"
Aurora <- replace_na(Aurora, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv( Aurora, path = "Aurora.csv")
#Austin
Austin1 <- read_csv("AustinSkills1.csv")
Austin2 <- read_csv("AustinSkills2.csv")
Austin3 <- read_csv("AustinSkills3.csv")
Austin4 <- read_csv("AustinSkills4.csv")
names(Austin1) <- c("Term", "DAnalyst")
names(Austin2) <- c("Term", "DScientist")
names(Austin3) <- c("Term", "DArchitect")
names(Austin4) <- c("Term", "DEngineer")
j1 <- full_join(Austin1, Austin2, by="Term")
j2 <- full_join(Austin3, Austin4, by="Term")
Austin <- full_join(j1, j2, by="Term")
Austin$City <- "Austin"
Austin$State <- "TX"
Austin <- replace_na(Austin, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv( Austin, path = "Austin.csv")
#Baltimore
Baltimore1 <- read_csv("BaltimoreSkills1.csv")
Baltimore2 <- read_csv("BaltimoreSkills2.csv")
Baltimore3 <- read_csv("BaltimoreSkills3.csv")
Baltimore4 <- read_csv("BaltimoreSkills4.csv")
names(Baltimore1) <- c("Term", "DAnalyst")
names(Baltimore2) <- c("Term", "DScientist")
names(Baltimore3) <- c("Term", "DArchitect")
names(Baltimore4) <- c("Term", "DEngineer")
j1 <- full_join(Baltimore1, Baltimore2, by="Term")
j2 <- full_join(Baltimore3, Baltimore4, by="Term")
Baltimore <- full_join(j1, j2, by="Term")
Baltimore$City <- "Baltimore"
Baltimore$State <- "MD"
Baltimore <- replace_na(Baltimore, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Baltimore, path = "Baltimore.csv")
#Boston
Boston1 <- read_csv("BostonSkills1.csv")
Boston2 <- read_csv("BostonSkills2.csv")
Boston3 <- read_csv("BostonSkills3.csv")
Boston4 <- read_csv("BostonSkills4.csv")
names(Boston1) <- c("Term", "DAnalyst")
names(Boston2) <- c("Term", "DScientist")
names(Boston3) <- c("Term", "DArchitect")
names(Boston4) <- c("Term", "DEngineer")
j1 <- full_join(Boston1, Boston2, by="Term")
j2 <- full_join(Boston3, Boston4, by="Term")
Boston <- full_join(j1, j2, by="Term")
Boston$City <- "Boston"
Boston$State <- "MA"
Boston <- replace_na(Boston, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Boston, path = "Boston.csv")
#Boulder
Boulder1 <- read_csv("BoulderSkills1.csv")
Boulder2 <- read_csv("BoulderSkills2.csv")
Boulder3 <- read_csv("BoulderSkills3.csv")
Boulder4 <- read_csv("BoulderSkills4.csv")
names(Boulder1) <- c("Term", "DAnalyst")
names(Boulder2) <- c("Term", "DScientist")
names(Boulder3) <- c("Term", "DArchitect")
names(Boulder4) <- c("Term", "DEngineer")
j1 <- full_join(Boulder1, Boulder2, by="Term")
j2 <- full_join(Boulder3, Boulder4, by="Term")
Boulder <- full_join(j1, j2, by="Term")
Boulder$City <- "Boulder"
Boulder$State <- "CO"
Boulder <- replace_na(Boulder, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Boulder, path = "Boulder.csv")
#Chandler
a <- read_csv("ChandlerSkills1.csv")
b <- read_csv("ChandlerSkills2.csv")
c <- read_csv("ChandlerSkills3.csv")
d <- read_csv("ChandlerSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Chandler <- full_join(j1, j2, by="Term")
Chandler$City <- "Chandler"
Chandler$State <- "AZ"
Chandler <- replace_na(Chandler, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Chandler, path = "Chandler.csv")
#Charlotte
a <- read_csv("CharlotteSkills1.csv")
b <- read_csv("CharlotteSkills2.csv")
c <- read_csv("CharlotteSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Charlotte <- full_join(j1, c, by="Term")
Charlotte$City <- "Charlotte"
Charlotte$State <- "NC"
Charlotte <- replace_na(Charlotte, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Charlotte, path = "Charlotte.csv")
#Chicago
Chicago1 <- read_csv("ChicagoSkills1.csv")
Chicago2 <- read_csv("ChicagoSkills2.csv")
Chicago3 <- read_csv("ChicagoSkills3.csv")
Chicago4 <- read_csv("ChicagoSkills4.csv")
names(Chicago1) <- c("Term", "DAnalyst")
names(Chicago2) <- c("Term", "DScientist")
names(Chicago3) <- c("Term", "DArchitect")
names(Chicago4) <- c("Term", "DEngineer")
j1 <- full_join(Chicago1, Chicago2, by="Term")
j2 <- full_join(Chicago3, Chicago4, by="Term")
Chicago <- full_join(j1, j2, by="Term")
Chicago$City <- "Chicago"
Chicago$State <- "IL"
Chicago <- replace_na(Chicago, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Chicago, path = "Chicago.csv")
#Cincinnati
a <- read_csv("CincinnatiSkills1.csv")
b <- read_csv("CincinnatiSkills2.csv")
c <- read_csv("CincinnatiSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Cincinnati <- full_join(j1, c, by="Term")
Cincinnati$City <- "Cincinnati"
Cincinnati$State <- "OH"
Cincinnati <- replace_na(Cincinnati, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Cincinnati, path = "Cincinnati.csv")
#Columbus
a <- read_csv("ColumbusSkills1.csv")
b <- read_csv("ColumbusSkills2.csv")
c <- read_csv("ColumbusSkills3.csv")
d <- read_csv("ColumbusSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Columbus <- full_join(j1, j2, by="Term")
Columbus$City <- "Columbus"
Columbus$State <- "OH"
Columbus <- replace_na(Columbus, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Columbus, path = "Columbus.csv")
#Dallas
a <- read_csv("DallasSkills1.csv")
b <- read_csv("DallasSkills2.csv")
c <- read_csv("DallasSkills3.csv")
d <- read_csv("DallasSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Dallas <- full_join(j1, j2, by="Term")
Dallas$City <- "Dallas"
Dallas$State <- "TX"
Dallas <- replace_na(Dallas, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Dallas, path = "Dallas.csv")
#Denver
a <- read_csv("DenverSkills1.csv")
b <- read_csv("DenverSkills2.csv")
c <- read_csv("DenverSkills3.csv")
d <- read_csv("DenverSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Denver <- full_join(j1, j2, by="Term")
Denver$City <- "Denver"
Denver$State <- "CO"
Denver <- replace_na(Denver, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Denver, path = "Denver.csv")
#Detroit
a <- read_csv("DetroitSkills1.csv")
b <- read_csv("DetroitSkills2.csv")
c <- read_csv("DetroitSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Detroit <- full_join(j1, c, by="Term")
Detroit$City <- "Detroit"
Detroit$State <- "MI"
Detroit <- replace_na(Detroit, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Detroit, path = "Detroit.csv")
#Durham
a <- read_csv("DurhamSkills1.csv")
b <- read_csv("DurhamSkills2.csv")
c <- read_csv("DurhamSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Durham <- full_join(j1, c, by="Term")
Durham$City <- "Durham"
Durham$State <- "NC"
Durham <- replace_na(Durham, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Durham, path = "Durham.csv")
#Fremont
a <- read_csv("FremontSkills1.csv")
b <- read_csv("FremontSkills2.csv")
c <- read_csv("FremontSkills3.csv")
d <- read_csv("FremontSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Fremont <- full_join(j1, j2, by="Term")
Fremont$City <- "Fremont"
Fremont$State <- "CA"
Fremont <- replace_na(Fremont, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Fremont, path = "Fremont.csv")
#FortWorth
a <- read_csv("FortWorthSkills1.csv")
b <- read_csv("FortWorthSkills2.csv")
c <- read_csv("FortWorthSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
FortWorth <- full_join(j1, c, by="Term")
FortWorth$City <- "FortWorth"
FortWorth$State <- "TX"
FortWorth <- replace_na(FortWorth, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(FortWorth, path = "FortWorth.csv")
#Garland
a <- read_csv("GarlandSkills1.csv")
b <- read_csv("GarlandSkills2.csv")
c <- read_csv("GarlandSkills3.csv")
d <- read_csv("GarlandSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Garland <- full_join(j1, j2, by="Term")
Garland$City <- "Garland"
Garland$State <- "TX"
Garland <- replace_na(Garland, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Garland, path = "Garland.csv")
#Gilbert
a <- read_csv("GilbertSkills1.csv")
b <- read_csv("GilbertSkills2.csv")
c <- read_csv("GilbertSkills3.csv")
d <- read_csv("GilbertSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Gilbert <- full_join(j1, j2, by="Term")
Gilbert$City <- "Gilbert"
Gilbert$State <- "AZ"
Gilbert <- replace_na(Gilbert, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Gilbert, path = "Gilbert.csv")
#Glendale
a <- read_csv("GlendaleSkills1.csv")
b <- read_csv("GlendaleSkills2.csv")
c <- read_csv("GlendaleSkills3.csv")
d <- read_csv("GlendaleSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Glendale <- full_join(j1, j2, by="Term")
Glendale$City <- "Glendale"
Glendale$State <- "AZ"
Glendale <- replace_na(Glendale, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Glendale, path = "Glendale.csv")
#Hialeah
a <- read_csv("HialeahSkills1.csv")
b <- read_csv("HialeahSkills2.csv")
c <- read_csv("HialeahSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Hialeah <- full_join(j1, c, by="Term")
Hialeah$City <- "Hialeah"
Hialeah$State <- "FL"
Hialeah <- replace_na(Hialeah, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Hialeah, path = "Hialeah.csv")
#Houston
a <- read_csv("HoustonSkills1.csv")
b <- read_csv("HoustonSkills2.csv")
c <- read_csv("HoustonSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Houston <- full_join(j1, c, by="Term")
Houston$City <- "Houston"
Houston$State <- "TX"
Houston <- replace_na(Houston, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Houston, path = "Houston.csv")
#Irvine
a <- read_csv("IrvineSkills1.csv")
b <- read_csv("IrvineSkills2.csv")
c <- read_csv("IrvineSkills3.csv")
d <- read_csv("IrvineSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Irvine <- full_join(j1, j2, by="Term")
Irvine$City <- "Irvine"
Irvine$State <- "CA"
Irvine <- replace_na(Irvine, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Irvine, path = "Irvine.csv")
#Irving
a <- read_csv("IrvingSkills1.csv")
b <- read_csv("IrvingSkills2.csv")
c <- read_csv("IrvingSkills3.csv")
d <- read_csv("IrvingSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Irving <- full_join(j1, j2, by="Term")
Irving$City <- "Irving"
Irving$State <- "TX"
Irving <- replace_na(Irving, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Irving, path = "Irving.csv")
#Jersey
a <- read_csv("JerseySkills1.csv")
b <- read_csv("JerseySkills2.csv")
c <- read_csv("JerseySkills3.csv")
d <- read_csv("JerseySkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Jersey <- full_join(j1, j2, by="Term")
Jersey$City <- "Jersey"
Jersey$State <- "NJ"
Jersey <- replace_na(Jersey, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Jersey, path = "Jersey.csv")
#LongBeach
a <- read_csv("LongBeachSkills1.csv")
b <- read_csv("LongBeachSkills2.csv")
c <- read_csv("LongBeachSkills3.csv")
d <- read_csv("LongBeachSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
LongBeach <- full_join(j1, j2, by="Term")
LongBeach$City <- "LongBeach"
LongBeach$State <- "CA"
LongBeach <- replace_na(LongBeach, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(LongBeach, path = "LongBeach.csv")
#LosAngeles
a <- read_csv("LosAngelesSkills1.csv")
b <- read_csv("LosAngelesSkills2.csv")
c <- read_csv("LosAngelesSkills3.csv")
d <- read_csv("LosAngelesSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
LosAngeles <- full_join(j1, j2, by="Term")
LosAngeles$City <- "Los Angeles"
LosAngeles$State <- "CA"
LosAngeles <- replace_na(LosAngeles, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(LosAngeles, path = "LosAngeles.csv")
#Mesa
a <- read_csv("MesaSkills1.csv")
b <- read_csv("MesaSkills2.csv")
c <- read_csv("MesaSkills3.csv")
d <- read_csv("MesaSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Mesa <- full_join(j1, j2, by="Term")
Mesa$City <- "Mesa"
Mesa$State <- "AZ"
Mesa <- replace_na(Mesa, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Mesa, path = "Mesa.csv")
#Miami
a <- read_csv("MiamiSkills1.csv")
b <- read_csv("MiamiSkills2.csv")
c <- read_csv("MiamiSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Miami <- full_join(j1, c, by="Term")
Miami$City <- "Miami"
Miami$State <- "FL"
Miami <- replace_na(Miami, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Miami, path = "Miami.csv")
#Minneapolis
a <- read_csv("MinneapolisSkills1.csv")
b <- read_csv("MinneapolisSkills2.csv")
c <- read_csv("MinneapolisSkills3.csv")
d <- read_csv("MinneapolisSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Minneapolis <- full_join(j1, j2, by="Term")
Minneapolis$City <- "Minneapolis"
Minneapolis$State <- "MN"
Minneapolis <- replace_na(Minneapolis, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Minneapolis, path = "Minneapolis.csv")
#Nashville
a <- read_csv("NashvilleSkills1.csv")
b <- read_csv("NashvilleSkills2.csv")
c <- read_csv("NashvilleSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Nashville <- full_join(j1, c, by="Term")
Nashville$City <- "Nashville"
Nashville$State <- "TN"
Nashville <- replace_na(Nashville, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Nashville, path = "Nashville.csv")
#Newark
a <- read_csv("NewarkSkills1.csv")
b <- read_csv("NewarkSkills2.csv")
c <- read_csv("NewarkSkills3.csv")
d <- read_csv("NewarkSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Newark <- full_join(j1, j2, by="Term")
Newark$City <- "Newark"
Newark$State <- "NJ"
Newark <- replace_na(Newark, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Newark, path = "Newark.csv")
#Oakland
a <- read_csv("OaklandSkills1.csv")
b <- read_csv("OaklandSkills2.csv")
c <- read_csv("OaklandSkills3.csv")
d <- read_csv("OaklandSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Oakland <- full_join(j1, j2, by="Term")
Oakland$City <- "Oakland"
Oakland$State <- "CA"
Oakland <- replace_na(Oakland, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Oakland, path = "Oakland.csv")
#Philadelphia
a <- read_csv("PhiladelphiaSkills1.csv")
b <- read_csv("PhiladelphiaSkills2.csv")
c <- read_csv("PhiladelphiaSkills3.csv")
d <- read_csv("PhiladelphiaSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Philadelphia <- full_join(j1, j2, by="Term")
Philadelphia$City <- "Philadelphia"
Philadelphia$State <- "PA"
Philadelphia <- replace_na(Philadelphia, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Philadelphia, path = "Philadelphia.csv")
#Pittsburgh
a <- read_csv("PittsburghSkills1.csv")
b <- read_csv("PittsburghSkills2.csv")
c <- read_csv("PittsburghSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Pittsburgh <- full_join(j1, c, by="Term")
Pittsburgh$City <- "Pittsburgh"
Pittsburgh$State <- "PA"
Pittsburgh <- replace_na(Pittsburgh, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Pittsburgh, path = "Pittsburgh.csv")
#Phoenix
a <- read_csv("PhoenixSkills1.csv")
b <- read_csv("PhoenixSkills2.csv")
c <- read_csv("PhoenixSkills3.csv")
d <- read_csv("PhoenixSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Phoenix <- full_join(j1, j2, by="Term")
Phoenix$City <- "Phoenix"
Phoenix$State <- "AZ"
Phoenix <- replace_na(Phoenix, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Phoenix, path = "Phoenix.csv")
#Plano
a <- read_csv("PlanoSkills1.csv")
b <- read_csv("PlanoSkills2.csv")
c <- read_csv("PlanoSkills3.csv")
d <- read_csv("PlanoSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Plano <- full_join(j1, j2, by="Term")
Plano$City <- "Plano"
Plano$State <- "TX"
Plano <- replace_na(Plano, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Plano, path = "Plano.csv")
#Pittsburgh
a <- read_csv("PittsburghSkills1.csv")
b <- read_csv("PittsburghSkills2.csv")
c <- read_csv("PittsburghSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Pittsburgh <- full_join(j1, c, by="Term")
Pittsburgh$City <- "Pittsburgh"
Pittsburgh$State <- "PA"
Pittsburgh <- replace_na(Pittsburgh, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Pittsburgh, path = "Pittsburgh.csv")
#Portland
a <- read_csv("PortlandSkills1.csv")
b <- read_csv("PortlandSkills2.csv")
c <- read_csv("PortlandSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Portland <- full_join(j1, c, by="Term")
Portland$City <- "Portland"
Portland$State <- "OR"
Portland <- replace_na(Portland, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Portland, path = "Portland.csv")
#Raleigh
a <- read_csv("RaleighSkills1.csv")
b <- read_csv("RaleighSkills2.csv")
c <- read_csv("RaleighSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Raleigh <- full_join(j1, c, by="Term")
Raleigh$City <- "Raleigh"
Raleigh$State <- "NC"
Raleigh <- replace_na(Raleigh, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Raleigh, path = "Raleigh.csv")
#SanDiego
a <- read_csv("SanDiegoSkills1.csv")
b <- read_csv("SanDiegoSkills2.csv")
c <- read_csv("SanDiegoSkills3.csv")
d <- read_csv("SanDiegoSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
SanDiego <- full_join(j1, j2, by="Term")
SanDiego$City <- "SanDiego"
SanDiego$State <- "CA"
SanDiego <- replace_na(SanDiego, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(SanDiego, path = "SanDiego.csv")
#SanFrancisco
a <- read_csv("SanFranciscoSkills1.csv")
b <- read_csv("SanFranciscoSkills2.csv")
c <- read_csv("SanFranciscoSkills3.csv")
d <- read_csv("SanFranciscoSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
SanFrancisco <- full_join(j1, j2, by="Term")
SanFrancisco$City <- "San Francisco"
SanFrancisco$State <- "CA"
SanFrancisco <- replace_na(SanFrancisco, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(SanFrancisco, path = "SanFrancisco.csv")
#SanJose
a <- read_csv("SanJoseSkills1.csv")
b <- read_csv("SanJoseSkills2.csv")
c <- read_csv("SanJoseSkills3.csv")
d <- read_csv("SanJoseSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
SanJose <- full_join(j1, j2, by="Term")
SanJose$City <- "San Jose"
SanJose$State <- "CA"
SanJose <- replace_na(SanJose, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(SanJose, path = "SanJose.csv")
#SantaAna
a <- read_csv("SantaAnaSkills1.csv")
b <- read_csv("SantaAnaSkills2.csv")
c <- read_csv("SantaAnaSkills3.csv")
d <- read_csv("SantaAnaSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
SantaAna <- full_join(j1, j2, by="Term")
SantaAna$City <- "Santa Ana"
SantaAna$State <- "CA"
SantaAna <- replace_na(SantaAna, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(SantaAna, path = "SantaAna.csv")
#Scottsdale
a <- read_csv("ScottsdaleSkills1.csv")
b <- read_csv("ScottsdaleSkills2.csv")
c <- read_csv("ScottsdaleSkills3.csv")
d <- read_csv("ScottsdaleSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Scottsdale <- full_join(j1, j2, by="Term")
Scottsdale$City <- "Scottsdale"
Scottsdale$State <- "AZ"
Scottsdale <- replace_na(Scottsdale, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Scottsdale, path = "Scottsdale.csv")
#Seattle
a <- read_csv("SeattleSkills1.csv")
b <- read_csv("SeattleSkills2.csv")
c <- read_csv("SeattleSkills3.csv")
d <- read_csv("SeattleSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Seattle <- full_join(j1, j2, by="Term")
Seattle$City <- "Seattle"
Seattle$State <- "WA"
Seattle <- replace_na(Seattle, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Seattle, path = "Seattle.csv")
#St.Louis
a <- read_csv("St.LouisSkills1.csv")
b <- read_csv("St.LouisSkills2.csv")
c <- read_csv("St.LouisSkills3.csv")
d <- read_csv("St.LouisSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
St.Louis <- full_join(j1, j2, by="Term")
St.Louis$City <- "St. Louis"
St.Louis$State <- "MO"
St.Louis <- replace_na(St.Louis, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(St.Louis, path = "St.Louis.csv")
#St.Paul
a <- read_csv("St.PaulSkills1.csv")
b <- read_csv("St.PaulSkills2.csv")
c <- read_csv("St.PaulSkills3.csv")
d <- read_csv("St.PaulSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
St.Paul <- full_join(j1, j2, by="Term")
St.Paul$City <- "St. Paul"
St.Paul$State <- "MN"
St.Paul <- replace_na(St.Paul, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(St.Paul, path = "St.Paul.csv")
#St.Petersburg
a <- read_csv("St.PetersburgSkills1.csv")
b <- read_csv("St.PetersburgSkills2.csv")
c <- read_csv("St.PetersburgSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
St.Petersburg <- full_join(j1, c, by="Term")
St.Petersburg$City <- "St.Petersburg"
St.Petersburg$State <- "FL"
St.Petersburg <- replace_na(St.Petersburg, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(St.Petersburg, path = "St.Petersburg.csv")
#Tampa
a <- read_csv("TampaSkills1.csv")
b <- read_csv("TampaSkills2.csv")
c <- read_csv("TampaSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Tampa <- full_join(j1, c, by="Term")
Tampa$City <- "Tampa"
Tampa$State <- "FL"
Tampa<- replace_na(Tampa, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Tampa, path = "Tampa.csv")
#Washington
a <- read_csv("WashingtonSkills1.csv")
b <- read_csv("WashingtonSkills2.csv")
c <- read_csv("WashingtonSkills3.csv")
d <- read_csv("WashingtonSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Washington <- full_join(j1, j2, by="Term")
Washington$City <- "Washington"
Washington$State <- "DC"
Washington <- replace_na(Washington, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Washington, path = "Washington.csv")
Combinging all cities:
master_39cities <- union(Anaheim, Arlington) %>% union(Atlanta) %>% union(Aurora) %>% union(Austin) %>% union(Baltimore) %>% union(Boston) %>% union(Boulder) %>% union(Chicago) %>% union(Dallas) %>% union(Denver) %>% union(Fremont) %>% union(Garland) %>% union(Irving) %>% union(Jersey) %>% union(LongBeach) %>% union(LosAngeles) %>% union(Minneapolis) %>% union(Newark) %>% union(NewYork) %>% union(Oakland) %>% union(Philadelphia) %>% union(Plano) %>% union(SanFrancisco) %>% union(SanJose) %>% union(SantaAna) %>% union(Seattle) %>% union(St.Louis) %>% union(St.Paul) %>% union(Washington) %>% union(Chandler) %>% union(Columbus) %>% union(Gilbert) %>% union(Glendale) %>% union(Irvine) %>% union(Mesa) %>% union(Phoenix) %>% union(SanDiego) %>% union(Scottsdale)
write_csv(master_39cities, path="master_39cities.csv")
master_53cities <- master_39cities %>% select(-DEngineer) %>% union(Charlotte) %>% union(Cincinnati) %>% union(Detroit) %>% union(Durham) %>% union(FortWorth) %>% union(Hialeah) %>% union(Houston) %>% union(Miami) %>% union(Nashville) %>% union(Pittsburgh) %>% union(Portland) %>% union(Raleigh) %>% union(St.Petersburg) %>% union(Tampa)
write_csv(master_53cities, path="master_53cities.csv")
After the lengthy process of wrangling and combining the data, I now have three .csv files:
master_39cities which include percentage of word frequency for all 4 job titles (Data Analysts, Data Scientists, Data Architects, Data Engineers) andmaster_53cities which include percentage of word frequency for 3 job titles (Data Analysts, Data Scientists, Data Architects)num_posting_US which include the number for job postings for each job title in each of the 53 citiesAfter the data collection process, I narrowed down my research questions to the following.
library(ggplot2)
library(ggrepel)
library(gridExtra)
library(stringr)
library(wordcloud)
library(RColorBrewer)
library(tm)
library(SnowballC)
Wrangling of num_posting data:
num_posting <- read_csv("num_posting_US.csv") #dataset that contains the number of job postings for each job title in each of the 53 cities
num_posting <- num_posting %>% separate(col=`City, State`, into=c("City", "State"), sep=",")
names(num_posting) <- c("City", "State", "DAnalyst", "DScientist", "DArchitect", "DEngineer")
num_posting <- replace_na(num_posting, list(DEngineer=0)) #Since the NAs for DEngineer are from cities that didn't have enough job postings (less than 10). For ease of analysis, I will consider them to be equal to zero.
num_posting_long <- num_posting %>% gather(job_title, num_posting, DAnalyst:DEngineer) #tidy data
num_posting %>%
ggplot() +
geom_line(aes(City, DAnalyst, group=1, color="Data Analyst")) +
geom_line(aes(City, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(City, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(City, DEngineer, group=1, color="Data Engineer")) +
ylab("Number of job postings") +
ggtitle("Number of job ads in 53 U.S. cities") +
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
You can see that almost all 53 cities had the most number of job ads for Data analysts followed by those for Data scientists, except for San Francisco, San Jose and Seattle.
There were similar number of job ads for data engineers and data architects, except for New York, Newark, Jersey City, San Jose, San Francisco, Seattle, Oakland and Fremont where there was a significantly higher number of job ads for data engineers than data architects.
num_posting_long %>%
group_by(City) %>%
summarize(total_num_posting = sum(num_posting)) %>%
ggplot(aes(City, total_num_posting))+
geom_point(color="purple") +
ylab("Number of job postings") +
ggtitle("Total number of job postings for Data Analysts, Data Scientists, Data Architects & Data Engineers") +
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
legend.text = element_text(size = 20))
In San Francisco, Newark, New York, Jersey City, Fremont, Oakland and Washington, there were a total of over 1000 job ads for Data Analysts, Data Scientists, Data Architects & Data Engineers combined together. Boston & Seattle had over 500 whereas San Jose had over 800.
Creating a master dataset for Canada:
canada1 <- read_csv("canada1.csv")
canada2 <- read_csv("canada2.csv")
canada3 <- read_csv("canada3.csv")
canada4 <- read_csv("canada4.csv")
names(canada1) <- c("Term", "DAnalyst")
names(canada2) <- c("Term", "DScientist")
names(canada3) <- c("Term", "DArchitect")
names(canada4) <- c("Term", "DEngineer")
j1 <- full_join(canada1, canada2, by="Term")
j2 <- full_join(canada3, canada4, by="Term")
canada <- full_join(j1, j2, by="Term")
canada <- replace_na(canada, replace=list(DAnalyst=0, DScientist=0, DArchitect=0, DEngineer=0))
canada_long <- canada %>% gather(job_title, percentage, DAnalyst: DEngineer) #tidy
canada %>%
ggplot() +
geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst")) +
geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
xlab("") +
ylab("Percentage appearing in job ads") +
ggtitle("Data Science Tools in demand in Canada") +
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
For data analysts, 38.3% of job ads asked for SQL, 34.2% for Excel, 11.4% for SAS, 11.2% for R, 10.9% for Tableau and only 7.1% for Python. (US_avg)
For data analysts, 43.2% of job ads asked for SQL, 41.18% for Excel, 16.1% for R, 16.1% for Tableau, 12.3% for Python and 10.5% for SAS.
canada %>% select(Term, DAnalyst) %>% arrange(desc(DAnalyst))
## Source: local data frame [27 x 2]
##
## Term DAnalyst
## (chr) (dbl)
## 1 SQL 43.222506
## 2 Excel 41.176471
## 3 R 16.112532
## 4 Tableau 16.112532
## 5 Python 12.276215
## 6 SAS 10.485934
## 7 JavaScript 10.230179
## 8 Perl 8.695652
## 9 Java 8.439898
## 10 Hadoop 6.649616
## .. ... ...
For data scientists, 47.8% of job ads asked for R, 44.1% for Python, 36.3% for Hadoop, 31.4 for SQL, 28.2% for Java, 26.6% for SAS and only 8.7% for Excel. (US_avg)
For data scientists, 34.4% of job ads asked for R, 34.4% for Python, 30.5% for SQL, 23.4% for Java, 19.5% for Tableau, 18.8% for Hadoop and 11.7% for Excel.
canada %>% select(Term, DScientist) %>% arrange(desc(DScientist))
## Source: local data frame [27 x 2]
##
## Term DScientist
## (chr) (dbl)
## 1 R 34.37500
## 2 Python 34.37500
## 3 SQL 30.46875
## 4 Java 23.43750
## 5 Tableau 19.53125
## 6 Hadoop 18.75000
## 7 SAS 17.96875
## 8 Spark 17.18750
## 9 JavaScript 13.28125
## 10 Excel 11.71875
## .. ... ...
For data architects, 45.3% asked for SQL, 32.9% for Hadoop, 31.7% for Java, 21.8% for NoSQL, 11.5% for R and 10.6% for Python. (avg_US)
For data architects, 32.7% asked for SQL, 13.9% for Java, 13.9% for Hadoop, 11.9% for Tableau, 11.9% for Python and only 5.9% for R.
canada %>% select(Term, DArchitect) %>% arrange(desc(DArchitect))
## Source: local data frame [27 x 2]
##
## Term DArchitect
## (chr) (dbl)
## 1 SQL 32.673267
## 2 Java 13.861386
## 3 Hadoop 13.861386
## 4 Tableau 11.881188
## 5 Python 11.881188
## 6 JavaScript 9.900990
## 7 C++ 7.920792
## 8 Hive 7.920792
## 9 SAS 6.930693
## 10 R 5.940594
## .. ... ...
For data engineers, 58.6% asked for Hadoop, 55.8% for Java, 44.6% for Spark, 31.2% for SQL, 27.6% for Python and only 7.2% for R.
For data engineers, 66.7% asked for Python, 63.0% for Hadoop, 40.7% for Spark, 37.0% for SQL & Scala, and 18.5% for Java.
canada %>% select(Term, DEngineer) %>% arrange(desc(DEngineer))
## Source: local data frame [27 x 2]
##
## Term DEngineer
## (chr) (dbl)
## 1 Python 66.66667
## 2 Hadoop 62.96296
## 3 Spark 40.74074
## 4 SQL 37.03704
## 5 Scala 37.03704
## 6 Hive 33.33333
## 7 Java 18.51852
## 8 HBase 18.51852
## 9 NoSQL 14.81481
## 10 Pig 14.81481
## .. ... ...
It’s really interesting to see that the top 2 most-mentioned tools for data analysts are SQL & Excel whereas those for data scientists are R & Python. The percentage of job ads that asked for SQL was pretty consistent across all four job titles. Same goes for Hadoop, except in job ads for Data analysts where it was asked only 4.9% of the time.
Creating a master dataset for USA nationwide count:
usa1 <- read_csv("national1.csv")
usa2 <- read_csv("national2.csv")
usa3 <- read_csv("national3.csv")
usa4 <- read_csv("national4.csv")
names(usa1) <- c("Term", "DAnalyst")
names(usa2) <- c("Term", "DScientist")
names(usa3) <- c("Term", "DArchitect")
names(usa4) <- c("Term", "DEngineer")
j1 <- full_join(usa1, usa2, by="Term")
j2 <- full_join(usa3, usa4, by="Term")
usa <- full_join(j1, j2, by="Term")
USA <- replace_na(usa, replace=list(DAnalyst=0, DScientist=0, DArchitect=0, DEngineer=0))
USA_long <- usa %>% gather(job_title, percentage, DAnalyst: DEngineer) #tidy
USA %>%
ggplot() +
geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst")) +
geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
xlab("") +
ylab("Percentage appearing in job ads") +
ggtitle("Data Science Tools in demand in USA") +
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
For data analysts, 38.3% of job ads asked for SQL, 34.2% for Excel, 11.4% for SAS, 11.2% for R, 10.9% for Tableau and only 7.1% for Python. (US_avg)
For data analysts, 43.2% of job ads asked for SQL, 41.18% for Excel, 16.1% for R, 16.1% for Tableau, 12.3% for Python and 10.5% for SAS. (Canada)
For data analysts, 56.2% of job ads asked for Excel, 27.5% for SQL, 12.5% for Tableau, 12.4% for R, 10.3% for Java and 5.5% for SAS.
USA %>% select(Term, DAnalyst) %>% arrange(desc(DAnalyst))
## Source: local data frame [30 x 2]
##
## Term DAnalyst
## (chr) (dbl)
## 1 Excel 56.221854
## 2 SQL 27.470965
## 3 Tableau 12.491112
## 4 R 12.396302
## 5 Java 10.263096
## 6 SAS 5.451529
## 7 Hadoop 3.602749
## 8 Python 2.939085
## 9 SPSS 2.844276
## 10 JavaScript 2.630955
## .. ... ...
For data scientists, 47.8% of job ads asked for R, 44.1% for Python, 36.3% for Hadoop, 31.4 for SQL, 28.2% for Java, 26.6% for SAS and only 8.7% for Excel. (US_avg)
For data scientists, 34.4% of job ads asked for R, 34.4% for Python, 30.5% for SQL, 23.4% for Java, 19.5% for Tableau, 18.8% for Hadoop and 11.7% for Excel.
For data scientists, 37.7% of job ads asked for Python, 36.6% for R, 29.4% for SQL, 27.0% for Hadoop, 24.9% for Java, 18.7% for Matlab, 12.8% for SAS and only 4.6% for Excel.
USA %>% select(Term, DScientist) %>% arrange(desc(DScientist))
## Source: local data frame [30 x 2]
##
## Term DScientist
## (chr) (dbl)
## 1 Python 37.65892
## 2 R 36.56291
## 3 SQL 29.41692
## 4 Hadoop 27.04954
## 5 Java 24.85752
## 6 Matlab 18.67602
## 7 Spark 13.15213
## 8 SAS 12.80140
## 9 Hive 12.71372
## 10 C++ 11.74923
## .. ... ...
For data architects, 45.3% asked for SQL, 32.9% for Hadoop, 31.7% for Java, 21.8% for NoSQL, 11.5% for R and 10.6% for Python. (avg_US)
For data architects, 32.7% asked for SQL, 13.9% for Java, 13.9% for Hadoop, 11.9% for Tableau, 11.9% for Python and only 5.9% for R.
For data architects, 45.2% asked for SQL, 18.0% for Java, 17.2% for Java, 16.1% for Tableau, 9.1% for Python, 7.9% for R.
USA %>% select(Term, DArchitect) %>% arrange(desc(DArchitect))
## Source: local data frame [30 x 2]
##
## Term DArchitect
## (chr) (dbl)
## 1 SQL 45.209581
## 2 Hadoop 17.964072
## 3 Java 17.215569
## 4 Tableau 16.092814
## 5 Spark 10.179641
## 6 NoSQL 10.029940
## 7 Python 9.131737
## 8 Hive 8.532934
## 9 R 7.859281
## 10 JavaScript 7.784431
## .. ... ...
For data engineers, 58.6% asked for Hadoop, 55.8% for Java, 44.6% for Spark, 31.2% for SQL, 27.6% for Python and only 7.2% for R.
For data engineers, 66.7% asked for Python, 63.0% for Hadoop, 40.7% for Spark, 37.0% for SQL & Scala, and 18.5% for Java.
For data engineers, 47.5% asked for SQL, 36.6% for Hadoop, 30.6% for Python, 28.7% for Java and 9.1% for R.
USA %>% select(Term, DEngineer) %>% arrange(desc(DEngineer))
## Source: local data frame [30 x 2]
##
## Term DEngineer
## (chr) (dbl)
## 1 SQL 47.48031
## 2 Hadoop 36.53543
## 3 Python 30.62992
## 4 Java 28.66142
## 5 Spark 20.78740
## 6 Hive 17.79528
## 7 NoSQL 12.99213
## 8 HBase 11.96850
## 9 Pig 11.10236
## 10 Scala 11.02362
## .. ... ...
It’s really interesting to see that the top 2 most-mentioned tools for data analysts are SQL & Excel whereas those for data scientists are R & Python. The percentage of job ads that asked for SQL was pretty consistent across all four job titles. Same goes for Hadoop, except in job ads for Data analysts where it was asked only 4.9% of the time.
Loading the datasets with the individual cities’ data:
master_39cities <- read_csv("master_39cities.csv")
master_53cities <- read_csv("master_53cities.csv")
master_39cities_long <- master_39cities %>% gather(job_title, percentage, DAnalyst: DEngineer) #tidy
master_53cities_long <- master_53cities %>% gather(job_title, percentage, DAnalyst: DArchitect) #tidy
Below, I’m calculating the average appearance of the term (name of Data Science Tools) in job postings gathered.
avg_appearance_39 <- master_39cities %>%
group_by(Term) %>%
summarize(avg_DAnalyst = mean(DAnalyst), avg_DScientist = mean(DScientist),
avg_DArchitect = mean(DArchitect), avg_DEngineer = mean(DEngineer))
avg_appearance_39 %>%
ggplot() +
geom_line(aes(Term, avg_DAnalyst, group=1, color="Data Analyst")) +
geom_line(aes(Term, avg_DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, avg_DArchitect, group=1, color="Data Architect")) +
geom_line(aes(Term, avg_DEngineer, group=1, color="Data Engineer")) +
xlab("") +
ylab("Average percentage of appearance in job ads") +
ggtitle("Data Science tools most common in job ads (39 U.S. cities)")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
For data analysts, 38.3% of job ads asked for SQL, 34.2% for Excel, 11.4% for SAS, 11.2% for R, 10.9% for Tableau and only 7.1% for Python.
avg_appearance_39 %>% select(Term, avg_DAnalyst) %>% arrange(desc(avg_DAnalyst))
## Source: local data frame [30 x 2]
##
## Term avg_DAnalyst
## (chr) (dbl)
## 1 SQL 38.329201
## 2 Excel 34.290584
## 3 SAS 11.357391
## 4 R 11.196742
## 5 Tableau 10.968081
## 6 Python 7.139954
## 7 JavaScript 6.549579
## 8 Java 6.119270
## 9 Hadoop 4.987034
## 10 SPSS 3.666810
## .. ... ...
For data scientists, 47.8% of job ads asked for R, 44.1% for Python, 36.3% for Hadoop, 31.4 for SQL, 28.2% for Java, 26.6% for SAS and only 8.7% for Excel.
avg_appearance_39 %>% select(Term, avg_DScientist) %>% arrange(desc(avg_DScientist))
## Source: local data frame [30 x 2]
##
## Term avg_DScientist
## (chr) (dbl)
## 1 R 47.77967
## 2 Python 44.11741
## 3 Hadoop 36.30741
## 4 SQL 31.42600
## 5 Java 28.16051
## 6 SAS 26.59200
## 7 Spark 17.80438
## 8 Hive 16.76739
## 9 Matlab 16.08920
## 10 Tableau 15.07856
## .. ... ...
For data architects, 45.3% asked for SQL, 32.9% for Hadoop, 31.7% for Java, 21.8% for NoSQL, 11.5% for R and 10.6% for Python.
avg_appearance_39 %>% select(Term, avg_DArchitect) %>% arrange(desc(avg_DArchitect))
## Source: local data frame [30 x 2]
##
## Term avg_DArchitect
## (chr) (dbl)
## 1 SQL 45.26388
## 2 Hadoop 32.86227
## 3 Java 31.69227
## 4 NoSQL 21.76291
## 5 Hive 14.17215
## 6 Pig 11.92220
## 7 R 11.55942
## 8 Cassandra 11.02397
## 9 Spark 10.65855
## 10 Python 10.56799
## .. ... ...
For data engineers, 58.6% asked for Hadoop, 55.8% for Java, 44.6% for Spark, 31.2% for SQL, 27.6% for Python and only 7.2% for R.
avg_appearance_39 %>% select(Term, avg_DEngineer) %>% arrange(desc(avg_DEngineer))
## Source: local data frame [30 x 2]
##
## Term avg_DEngineer
## (chr) (dbl)
## 1 Hadoop 58.60279
## 2 Java 55.79221
## 3 Spark 44.61854
## 4 Hive 38.81320
## 5 HBase 37.50008
## 6 Cassandra 31.62938
## 7 SQL 31.17291
## 8 Pig 30.94266
## 9 MapReduce 30.91452
## 10 Python 27.63428
## .. ... ...
It’s really interesting to see that the top 2 most-mentioned tools for data analysts are SQL & Excel whereas those for data scientists are R & Python. The percentage of job ads that asked for SQL was pretty consistent across all four job titles. Same goes for Hadoop, except in job ads for Data analysts where it was asked only 4.9% of the time.
#I"m also checking if the trends are any different when looked at 53 cities. Since the dataset does not have the numbers for data engineers, I'll only be looking at data analysts, data scientists & data architects.
avg_appearance_53 <- master_53cities %>% group_by(Term) %>% summarize(avg_DAnalyst = mean(DAnalyst), avg_DScientist = mean(DScientist), avg_DArchitect = mean(DArchitect))
avg_appearance_53 %>%
ggplot() +
geom_line(aes(Term, avg_DAnalyst, group=1, color="Data Analyst")) +
geom_line(aes(Term, avg_DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, avg_DArchitect, group=1, color="Data Architect")) +
xlab("") +
ylab("Average percentage of appearance in job ads") +
ggtitle("Data Science tools most common in job ads (53 U.S. cities)")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
You can observe similar trends for the 53 cities as well.
avg_appearance_53 %>% select(Term, avg_DAnalyst) %>% arrange(desc(avg_DAnalyst))
## Source: local data frame [30 x 2]
##
## Term avg_DAnalyst
## (chr) (dbl)
## 1 SQL 40.192633
## 2 Excel 34.378379
## 3 SAS 11.604102
## 4 Tableau 10.653617
## 5 R 9.524691
## 6 Python 6.890490
## 7 JavaScript 6.689093
## 8 Java 5.842706
## 9 Hadoop 4.882816
## 10 SPSS 3.551104
## .. ... ...
avg_appearance_53 %>% select(Term, avg_DScientist) %>% arrange(desc(avg_DScientist))
## Source: local data frame [30 x 2]
##
## Term avg_DScientist
## (chr) (dbl)
## 1 R 44.95279
## 2 Python 42.06721
## 3 Hadoop 32.43743
## 4 SQL 30.70055
## 5 SAS 27.11487
## 6 Java 24.88270
## 7 Matlab 18.85610
## 8 Tableau 17.41630
## 9 Hive 17.38719
## 10 Spark 15.80257
## .. ... ...
avg_appearance_53 %>% select(Term, avg_DArchitect) %>% arrange(desc(avg_DArchitect))
## Source: local data frame [30 x 2]
##
## Term avg_DArchitect
## (chr) (dbl)
## 1 SQL 47.65299
## 2 Hadoop 31.11683
## 3 Java 28.27701
## 4 NoSQL 22.02256
## 5 Hive 15.46513
## 6 Pig 12.77114
## 7 Cassandra 11.95777
## 8 Spark 11.28517
## 9 HBase 10.33732
## 10 R 10.33601
## .. ... ...
master_53cities %>%
ggplot(aes(Term, DAnalyst)) +
geom_boxplot(color="red") +
xlab("") +
ylab("Percentage of appearance in job ads") +
ggtitle("Data Science Tools in demand for Data Analysts")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
master_53cities %>%
ggplot(aes(Term, DScientist)) +
geom_boxplot(color="purple") +
xlab("") +
ylab("Percentage of appearance in job ads") +
ggtitle("Data Science Tools in demand for Data Scientists")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
master_53cities %>%
ggplot(aes(Term, DArchitect)) +
geom_boxplot(color="green") +
xlab("") +
ylab("Percentage of appearance in job ads") +
ggtitle("Data Science Tools in demand for Data Architects")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
master_39cities %>%
ggplot(aes(Term, DEngineer)) +
geom_boxplot(color="turquoise") +
xlab("") +
ylab("Percentage of appearance in job ads") +
ggtitle("Data Science Tools in demand for Data Engineers")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
master_39cities%>%
filter (City == "New York" | City=="Newark" | City =="Jersey" | City =="Washington") %>%
ggplot() +
geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
facet_wrap(~City) +
xlab("") +
ylab("Percentage of appearance in job postings") +
ggtitle("Percentage of job ads with a data science tool")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
master_39cities%>%
filter (City == "Boston" | City=="Oakland" | City =="San Francisco" | City =="Fremont") %>%
ggplot() +
geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
facet_wrap(~City) +
xlab("") +
ylab("Percentage of appearance in job postings") +
ggtitle("Percentage of job ads with a data science tool")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
master_39cities%>%
filter (City == "Los Angeles" | City=="Atlanta" | City =="Long Beach" | City =="Irving") %>%
ggplot() +
geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
facet_wrap(~City) +
xlab("") +
ylab("Percentage of appearance in job postings") +
ggtitle("Percentage of job ads with a data science tool")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
master_39cities%>%
filter (City == "San Jose" | City=="Chicago" | City =="Anaheim" | City =="Dallas") %>%
ggplot() +
geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
facet_wrap(~City) +
xlab("") +
ylab("Percentage of appearance in job postings") +
ggtitle("Percentage of job ads with a data science tool")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
master_39cities%>%
filter (City == "Garland" | City=="Plano" | City =="Arlington" | City =="Philadelphia") %>%
ggplot() +
geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
facet_wrap(~City) +
xlab("") +
ylab("Percentage of appearance in job postings") +
ggtitle("Percentage of job ads with a data science tool")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
master_39cities%>%
filter (City == "Seattle" | City=="Minneapolis" | City =="St. Paul" | City =="Baltimore") %>%
ggplot() +
geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
facet_wrap(~City) +
xlab("") +
ylab("Percentage of appearance in job postings") +
ggtitle("Percentage of job ads with a data science tool")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
master_39cities%>%
filter (City == "Denver" | City=="Santa Ana" | City =="St. Louis" | City =="Aurora") %>%
ggplot() +
geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
facet_wrap(~City) +
xlab("") +
ylab("Percentage of appearance in job postings") +
ggtitle("Percentage of job ads with a data science tool")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
master_39cities%>%
filter (City == "Mesa" | City=="Phoenix" | City =="Scottsdale" | City =="Irvine") %>%
ggplot() +
geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
facet_wrap(~City) +
xlab("") +
ylab("Percentage of appearance in job postings") +
ggtitle("Percentage of job ads with a data science tool")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
master_39cities%>%
filter (City == "Chandler" | City=="Gilbert" | City =="Glendale" | City =="Boulder") %>%
ggplot() +
geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
facet_wrap(~City) +
xlab("") +
ylab("Percentage of appearance in job postings") +
ggtitle("Percentage of job ads with a data science tool")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
master_39cities%>%
filter (City == "Austin" | City=="Columbus" | City =="San Diego") %>%
ggplot() +
geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
facet_wrap(~City) +
xlab("") +
ylab("Percentage of appearance in job postings") +
ggtitle("Percentage of job ads with a data science tool")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
load("data_scientist.rda")
stackoverflow <- data_scientist %>% select(JobId, Title, Tags, LocationString, Description, CountryCode, StateCode)
Description <- stackoverflow %>% select(Description)
count1 <- function(term, lang) { #Counting just one variation of the term
lang = str_count(Description$Description, term) %>% data.frame()
names(lang) <- "count"
lang <- lang %>% mutate(count=ifelse(count == 0, 0, 1))
#Scenerio 1: If the count is 0, the term didn't appear in the job ad.
#Scenerio 2: If the count is not equal to zero, the term appeared 1 or more times in the job ad.
#Since we are only interested in the proportion of job ads that contain the term, we will make the count = 1 for scenerio 2.
lang
}
count2 <- function(term1, term2, lang) { #Counting 2 variations of the term
a = str_count(Description$Description, term1) %>% data.frame()
names(a) <- "count"
b = str_count(Description$Description, term2) %>% data.frame()
names(b) <- "count"
lang <- a + b
lang <- lang %>% mutate(count=ifelse(count== 0, 0, 1))
lang
}
count3 <- function(term1, term2, term3, lang) { #Counting 3 variations of the term
a = str_count(Description$Description, term1) %>% data.frame()
names(a) <- "count"
b = str_count(Description$Description, term2) %>% data.frame()
names(b) <- "count"
c = str_count(Description$Description, term3) %>% data.frame()
names(c) <- "count"
lang <- a + b + c
lang <- lang %>% mutate(count=ifelse(count== 0, 0, 1))
lang
}
count7 <- function(term1, term2, term3, term4, term5, term6, term7, lang) { #Counting 7 variations of the term
a = str_count(Description$Description, term1) %>% data.frame()
names(a) <- "count"
b = str_count(Description$Description, term2) %>% data.frame()
names(b) <- "count"
c = str_count(Description$Description, term3) %>% data.frame()
names(c) <- "count"
d = str_count(Description$Description, term4) %>% data.frame()
names(d) <- "count"
e = str_count(Description$Description, term5) %>% data.frame()
names(e) <- "count"
f = str_count(Description$Description, term6) %>% data.frame()
names(f) <- "count"
g = str_count(Description$Description, term7) %>% data.frame()
names(g) <- "count"
lang <- a + b + c +d + e+ f + g
lang <- lang %>% mutate(count=ifelse(count== 0, 0, 1))
lang
}
count3_fixed3 <- function(term1, term2, term3, lang){ #Counting 3 fixed variations of the term
a <- grepl(term1, Description$Description) %>% data.frame()
#When I was trying to count R, `str_count` was counting all R's in words, not just the specific one separate word "R".
#Therefore, I switched to grepl, that allows me to do so.
names(a) <- "count"
a <- a %>% mutate(count=ifelse(count==TRUE, 1, 0))
b <- grepl(term2, Description$Description) %>% data.frame()
names(b) <- "count"
b <- b %>% mutate(count=ifelse(count==TRUE, 1, 0))
c <- grepl(term3, Description$Description) %>% data.frame()
names(c) <- "count"
c <- c %>% mutate(count=ifelse(count==TRUE, 1, 0))
lang <- a+b+c
lang <- lang %>% mutate(count=ifelse(count == 0, 0, 1))
lang
}
count6_fixed6 <- function(term1, term2, term3, term4, term5, term6, lang){ #Counting 6 fixed variations of the term
a <- grepl(term1, Description$Description) %>% data.frame()
names(a) <- "count"
a <- a %>% mutate(count=ifelse(count==TRUE, 1, 0))
b <- grepl(term2, Description$Description) %>% data.frame()
names(b) <- "count"
b <- b %>% mutate(count=ifelse(count==TRUE, 1, 0))
c <- grepl(term3, Description$Description) %>% data.frame()
names(c) <- "count"
c <- c %>% mutate(count=ifelse(count==TRUE, 1, 0))
d <- grepl(term4, Description$Description) %>% data.frame()
names(d) <- "count"
d <- d %>% mutate(count=ifelse(count==TRUE, 1, 0))
e <- grepl(term5, Description$Description) %>% data.frame()
names(e) <- "count"
e <- e %>% mutate(count=ifelse(count==TRUE, 1, 0))
f <- grepl(term6, Description$Description) %>% data.frame()
names(f) <- "count"
f <- f %>% mutate(count=ifelse(count==TRUE, 1, 0))
lang <- a+b+c+d+e+f
lang <- lang %>% mutate(count=ifelse(count == 0, 0, 1))
lang
}
count9_fixed5 <- function(term1, term2, term3, term4, #term1-5 are fixed words
term5, term6, term7, term8, term9, lang) {
a <- grepl(term1, Description$Description) %>% data.frame()
names(a) <- "count"
a <- a %>% mutate(count=ifelse(count==TRUE, 1, 0))
b <- grepl(term2, Description$Description) %>% data.frame()
names(b) <- "count"
b <- b %>% mutate(count=ifelse(count==TRUE, 1, 0))
c <- grepl(term3, Description$Description) %>% data.frame()
names(c) <- "count"
c <- c %>% mutate(count=ifelse(count==TRUE, 1, 0))
d <- grepl(term4, Description$Description) %>% data.frame()
names(d) <- "count"
d <- d %>% mutate(count=ifelse(count==TRUE, 1, 0))
e <- grepl(term5, Description$Description) %>% data.frame()
names(e) <- "count"
e <- e %>% mutate(count=ifelse(count==TRUE, 1, 0))
f = str_count(Description$Description, term6) %>% data.frame()
names(f) <- "count"
g = str_count(Description$Description, term7) %>% data.frame()
names(g) <- "count"
h = str_count(Description$Description, term8) %>% data.frame()
names(h) <- "count"
i = str_count(Description$Description, term9) %>% data.frame()
names(i) <- "count"
lang <- a + b + c +d + e+ f + g + h+ i
lang <- lang %>% mutate(count=ifelse(count== 0, 0, 1))
lang
}
stack_count that contains the columns that indicate if a particular job ad had the term or not (yes = 1, no =0)R <- count3_fixed3("\\<R\\>", "\\<R.\\>", "\\<R,\\>", R)
Python <- count2("Python", "python", Python)
Java <- count2("Java", "java", Java)
C_plus_plus <- count2("C\\+\\+", "c\\+\\+", C_plus_plus)
Ruby <- count2("Ruby", "ruby", Ruby)
Perl <- count2("Perl", "perl", Perl)
Matlab <- count2("Matlab", "matlab", Matlab)
Javascript <- count2("Javascript", "javascript", Javascript)
Scala <- count2("Scala", "scala", Scala)
Excel <- count2("Excel", "excel",Excel)
Tableau <- count2("Tableau","tableau", Tableau)
D3.js <- count2("D3.js","d3.js", D3.js)
SAS <- count2("SAS","sas", SAS)
SPSS <- count2("SPSS","spss", SPSS)
D3 <- count2("d3", "D3", D3)
Hadoop <- count2("Hadoop", "hadoop", Hadoop)
MapReduce <- count2("MapReduce","mapreduce", MapReduce)
Spark <- count2("Spark", "spark", Spark)
Pig <- count2("Pig","pig", Pig)
Hive <- count2("Hive", "hive", Hive)
Shark <- count2("Shark", "shark", Shark)
Oozie <- count2("Oozie", "oozie", Oozie)
Zookeeper <- count2("Zookeeper", "zookeeper", Zookeeper)
Flume <- count2("Flume", "flume", Flume)
Mahout <- count2("Mahout", "mahout", Mahout)
SQL <- count6_fixed6("\\<SQL\\>", "\\<sql\\>","\\<SQL.\\>", "\\<SQL,\\>", "\\<sql.\\>", "\\<sql,\\>", SQL)
NoSQL <- count2("NoSQL", "nosql", NoSQL)
HBase <- count2("HBase", "hbase", HBase)
Cassandra <- count2("Cassandra", "cassandra", Cassandra)
MongoDB <- count2("MongoDB","mongodb", MongoDB)
MachineLearning <- count2("Machine Learning", "machine learning", MachineLearning)
BigData <- count3("Big Data", "big data", "Big data", BigData)
PhD <- count3("PhD", "Ph.D", "Ph.D.", PhD)
Bachelors <- count7("Undergraduate", "undergraduate", "BA", "BS", "BSc", "Bachelor\'s", "Bachelors", Bachelors)
Masters <- count9_fixed5("\\<Ms\\>", "\\<MS\\>", "\\<M.S.\\>", "\\<MA\\>", "\\<MBA\\>", "MS degree","Master\'s", "Master", "Masters", Masters)
stack_count <- cbind(R, Python, Java, C_plus_plus, Ruby, Perl, Matlab, Javascript, Scala, Excel, Tableau, D3.js, SAS, SPSS, D3, Hadoop, MapReduce, Spark, Pig, Hive, Shark, Oozie, Zookeeper, Flume, Mahout, SQL, NoSQL, HBase, Cassandra, MongoDB, MachineLearning, BigData, PhD, Bachelors, Masters)
names(stack_count) <- c("R", "Python", "Java", "C_plus_plus", "Ruby", "Perl", "Matlab", "Javascript", "Scala", "Excel", "Tableau", "D3.js", "SAS", "SPSS", "D3", "Hadoop", "MapReduce", "Spark", "Pig", "Hive", "Shark", "Oozie", "Zookeeper", "Flume", "Mahout", "SQL", "NoSQL", "HBase", "Cassandra", "MongoDB", "MachineLearning", "BigData", "PhD", "Bachelors", "Masters")
stack_count <- cbind(stackoverflow, stack_count)
sum_count with columns term, count (total number job ads containg the term) & freq (% of job ads containing the term)sum_count <- stack_count %>%
summarize(R = sum(R), Python = sum(Python), Java= sum(Java),
`C++` = sum(C_plus_plus), Ruby= sum(Ruby), Perl= sum(Perl),
Matlab= sum(Matlab), Javascript= sum(Javascript), Scala= sum(Scala),
Excel= sum(Excel), Tableau= sum(Tableau), D3.js=sum(D3.js),
SAS=sum(SAS), SPSS=sum(SPSS), D3=sum(D3), Hadoop=sum(Hadoop),
MapReduce=sum(MapReduce),Spark=sum(Spark), Pig=sum(Pig), Hive=sum(Hive),
Shark=sum(Shark), Oozie=sum(Oozie), Zookeeper=sum(Zookeeper),
Flume=sum(Flume), Mahout=sum(Mahout), SQL=sum(SQL), NoSQL=sum(NoSQL),
HBase=sum(HBase), Cassandra=sum(Cassandra),
MongoDB=sum(MongoDB),MachineLearning=sum(MachineLearning),
BigData=sum(BigData), PhD=sum(PhD), Bachelors=sum(Bachelors),
Masters=sum(Masters)) %>%
tbl_df() %>%
gather(term, count, R:Masters) #count = number of job ads that include the `term`
sum_count <- sum_count %>% mutate(freq=count/362, count)
#freq = % of job ads that contain the `term`
pal2 <- brewer.pal(8,"Dark2")
wordcloud(sum_count$term, sum_count$freq, scale = c(4, 1.5), random.order = FALSE, colors=pal2, rot.per = 0, fixed.asp = FALSE)
Description1 <- Description %>%
mutate(Description= gsub("\\[|\\]", "", Description)) %>%
mutate(Description= gsub("looking", "", Description)) %>%
mutate(Description= gsub("looking", "", Description)) %>%
mutate(Description= gsub("position", "", Description)) %>%
mutate(Description= gsub("one", "", Description)) %>%
mutate(Description= gsub("rel=\"nofollow\"", "", Description)) %>%
mutate(Description= gsub("also", "", Description)) %>%
mutate(Description= gsub("company", "", Description)) %>%
mutate(Description= gsub("<li>experience", "", Description)) %>%
mutate(Description= gsub("well", "", Description)) %>%
mutate(Description= gsub("<li>work", "", Description)) %>%
mutate(Description= gsub("<li><span>", "", Description)) %>%
mutate(Description= gsub("<p>", "", Description)) %>%
mutate(Description= gsub("</p>", "", Description)) %>%
mutate(Description= gsub("<p><strong>", "", Description)) %>%
mutate(Description= gsub("<ul>", "", Description)) %>%
mutate(Description= gsub("</ul>", "", Description)) %>%
mutate(Description= gsub("<li>", "", Description)) %>%
mutate(Description= gsub("</li>", "", Description)) %>%
mutate(Description= gsub("&", "", Description)) %>%
mutate(Description= gsub(" ", "", Description)) %>%
mutate(Description= gsub("</strong>", "", Description)) %>%
mutate(Description= gsub("<strong>", "", Description)) %>%
mutate(Description= gsub("<span>", "", Description)) %>%
mutate(Description= gsub("<br>", "", Description)) %>%
mutate(Description= gsub("<em>", "", Description)) %>%
mutate(Description= gsub("</span>", "", Description)) %>%
mutate(Description= gsub("<a href", "", Description)) %>%
mutate(Description= gsub("=\"http://www", "", Description)) %>%
mutate(Description= gsub("&rsquo", "", Description)) %>%
mutate(Description= gsub("youll", "", Description)) %>%
mutate(Description= gsub("will", "", Description)) %>%
mutate(Description= gsub("within", "", Description)) %>%
mutate(Description= gsub("can", "", Description)) %>%
mutate(Description= gsub("using", "", Description)) %>%
mutate(Description= gsub("apply", "", Description)) %>%
mutate(Description= gsub("part", "", Description)) %>%
mutate(Description=gsub("analysing", "analyse", Description))
r_words <- c("right", "youll", "get", "next", "high", "bull","way", "etc", "didate", "based", "every", "take", "ndash", "real", "core", "key", "day", "able", "van", "time", "used", "highly", "years", "sets", "want", "seeking", "senior", "related", "join", "ability", "new", "teams", "role", "use", "like", "make", "across", "provide", "drive", "help", "work", "working", "including", "questions", "environment", "implement", "developing", "products", "responsible", "need", "closely", "recommendations", "just", "ing", "set", "andor")
review_text <- paste(Description1$Description, collapse=" ")
review_source <- VectorSource(review_text)
corpus <- Corpus(review_source)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removeWords, r_words)
dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing=TRUE)
head(frequency)
## data team experience business scientist learning
## 2249 564 382 378 344 326
words <- names(frequency)
wordcloud(words, frequency, scale=c(3.5,.7), min.freq=30, random.order=FALSE, colors=brewer.pal(8, "Dark2"), max.words = 250, rot.per = 0, fixed.asp = FALSE, random.color = TRUE)